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1  Introduction 

The  Naval  Research  Laboratory’s  (NRL)  Code  7440  Production  Enhancement  Team  at 
the  Stennis  Space  Center  has  been  tasked  to  develop  ways  to  speedup  hydrographic  data 
processing  at  the  Naval  Oceanographic  Office  (NAVOCEANO)  [Depn02],  This  paper 
presents  the  final  development  of  a  parallelized  version  of  the  Pfmloader  application 
customized  to  run  on  a  Beowulf  cluster. 

This  paper  specifically  covers  software  development  efforts  in  early  FY02  [Sam02  ].  In  a 
series  of  algorithms,  called  Schemes  D,  E,  and  F,  a  parallel  algorithm  previously 
implemented  in  Scheme  C  (see  Figure  1)  has  been  integrated  with  an  improved  version 
NAVOCEANO’s  of  the  Pure  File  Magic  (PFM)  library.  Former  versions  of  this  software 
not  mentioned,  namely  Schemes  A  and  B,  were  used  as  a  design  platform  for  the 
software  architecture  found  in  Scheme  C.  The  main  goal  of  this  work  was  to  increase  the 
writing  rate  of  binned  data  to  a  physical  disk.  The  final  software  version  is  Scheme  F. 

The  final  parallel  code  of  Scheme  F  achieves  the  best  speedup  for  the  largest  available 
test  dataset.  The  simple  runs  (no  filtering)  exhibited  a  speedup  of  10  as  compared  to  the 
original,  serial  algorithm.  Runs  with  swath  filtering  showed  a  top  speedup  of  8.  Runs 
with  area  filtering  reached  a  speedup  of  6.5.  Runs  with  both  swath  and  area  filtering 
showed  a  top  speedup  of  7.  In  all  cases,  data  strongly  suggest  that  greater  speedups  could 
be  achieved  for  larger  than  tested  input  datasets. 

Section  2  of  this  paper  is  devoted  to  Scheme  D.  Section  3  presents  results  on  an 
implemented  threaded  version  of  Scheme  D.  Section  4  deals  with  results  for  Scheme  E. 
Section  5  describes  Scheme  F.  Detailed  results  are  presented  on  the  performance  of 
Scheme  F  when  swath  and  area  based  filtering  are  enabled.  Section  6  presents  results 
illustrating  the  effect  of  the  physical  disk’s  I/O  throughput  rates  on  speedup  results. 
Section  7  discusses  results  and  future  plans. 


2  Scheme  D 

In  Scheme  D,  as  in  Scheme  C,  nodes  are  assigned  one  of  the  following  roles  —  a  master 
node,  an  I/O  node,  or  a  slave  node.  In  general,  more  of  the  functionality  of  the  original 
PFM  library  has  been  assigned  to  the  slave  nodes  in  Scheme  D  than  in  earlier  schemes. 

•  Each  slave  node  sorts  read-in  sounding  data  according  to  bin  index  and 
compresses  them  in  the  same  way  as  is  done  in  the  original  PFM  library.  The 
sending  buffer  is  now  in  the  form  of  a  character  string. 

•  The  I/O  node  receives  sounding  data  in  the  PFM  compressed  form  and  copies 
them  into  the  PFM-style  depth  blocks  consisting  of  six  sounding  data  and  a 
continuation  pointer. 

•  The  master  node  role  has  not  been  changed  from  Scheme  C. 
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2.1  Results 

Figure  2  illustrates  achieved  speedups  for  different  loads.  The  test  datasets  are  denoted, 
as  previously,  by  “L”  (the  Liberty  dataset),  “12”  (12  files),  “24”  (24  files),  “48”  (48  files), 
and  “74”  (74  files).  Timings  have  been  averaged  over  four  runs.  Speedup  figures  are 
created  by  comparing  parallel  and  serial  runs.  In  the  case  of  algorithm  speedups,  the 


Figure  1 :  Scheme  C  (the  precursor  to  the  schemes  presented  in  this  paper) 

comparisons  are  made  with  corresponding  timings  for  three  Message  Passing  Interface 
(MPI)  processes.  Speedups  exhibited  by  Scheme  D  reach  about  7  for  the  four  larger 
dataset  loads.  For  the  smallest  dataset,  Scheme  D  shows  a  speedup  of  around  5,  cutting 
the  run  time  from  around  2  min  to  23  sec.  For  the  largest  input  dataset  tested,  the 
measured  speedup  is  7  fold,  reducing  execution  time  from  around  24  min  to  3  min  20  sec. 
The  optimal  number  of  MPI  processes  is  9  for  this  scheme. 

3  Threading 

3.1  Initial  Plan 

The  overall  processing  flow  can  be  improved  by  grouping  tasks  into  at  least  two  threads 
on  the  I/O  node  and  a  slave  node.  Tasks  related  to  MPI  communication  can  be 
accomplished  by  a  separate  thread.  A  testing  program  mixing  MPI  and  threading 
confirmed  that  MPI  Chamleon  (MPICH)  allows  only  a  single  thread  to  execute  MPI  calls. 
On  the  I/O  node,  a  separate  communication  thread  would  take  care  of  receiving  buffers. 
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The  other  thread  would  be  responsible  for  writing  data  to  the  PFM  output  file.  On  the 
slave  node,  a  communication  thread  would  send  full  buffers  to  the  I/O  node.  The  second 
thread  would  transfer  and  process  sounding  data  from  input  files  to  buffers.  Tests  are 
planned  to  check  if  the  currently  used  dual-CPU  system  boards  will  efficiently  support 
two  application  threads.  A  quad  system  board  for  the  I/O  node  could  be  used  if  that 
proved  beneficial. 

3.2  Threading  the  I/O  Node:  Algorithm 

Only  one  part  of  the  above  plan  has  been  implemented:  Scheme  D  has  been  tested  using 
Linux  Posix  threads  on  the  I/O  node  only.  Processing  on  the  I/O  node  has  been  separated 
into  two  threads.  The  second  thread  submits  data  to  an  output  PFM  file.  The  main  thread 
does  all  necessary  initialization  and  then  creates  one  additional  thread,  called  the  I/O 
processing  thread.  The  main  thread  acts  as  the  I/O  communication  thread.  All  MPI 
function  calls  are  done  only  by  the  I/O  communication  thread.  The  I/O  communication 
thread  receives  buffers  from  worker  nodes  and  submits  them  to  a  work  queue.  The  I/O 
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Figure  2:  Achieved  speedup  of  Scheme  D  for  different  loads. 


processing  thread  checks  the  work  queue  for  any  full  buffers,  and  uses  PFM  function 
calls  to  write  data  to  the  PFM  output  file.  An  empty  buffer  is  returned  to  the  work  queue 
area.  The  algorithm  implements  reuse  of  buffers  in  order  to  avoid  reinitializing  buffers. 
All  access  to  the  work  queue  is  safeguarded  by  “a  mutex  lock”  (Mutually  Exclusive 
Access  Lock)  mechanism,  a  standard  tool  available  in  the  thread  library.  A  set  of  a  few 
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empty  buffers  is  initialized  in  advance  on  the  I/O  communication  side.  If  all  empty 
buffers  on  the  I/O  communication  node  are  used,  that  thread  enters  the  work  queue  and 
gets  their  empty  buffers.  If  no  empty  buffers  are  available  in  the  work  queue  area,  the  I/O 
communication  thread  initializes  a  fresh  buffer.  Only  if  initializing  the  fresh  buffer  fails, 
the  I/O  communication  thread  waits  for  the  I/O  processing  thread  to  return  a  buffer.  The 
creation  of  additional  buffers  is  always  expected  because  PFM  library  operations  are  the 
slowest  part  of  processing  (since  these  library  operations  require  multiple  accesses  to  the 
physical  disk  drive). 

3.3  Threading  the  I/O  Node:  Results 

Figure  3  illustrates  achieved  speedups  for  different  loads.  The  test  datasets  are  denoted, 
as  previously,  by  “L”  (the  Liberty  dataset),  “12”  (12  files),  “24”  (24  files),  “48”  (48  files), 
and  “74”  (74  files).  Speedup  values  for  the  Scheme  D-Threaded  version  are  noticeably 
smaller  than  the  original  Scheme  D.  Such  results  could  be  attributed  to  the  additional 
overhead  of  the  thread  library.  However,  results  for  the  largest  test  dataset  (“74”)  are 
especially  disappointing,  due  to  the  lack  of  control  of  the  memory  usage  on  the  I/O  node 
in  the  threaded  code.  This  preliminary  threaded  version  has  no  control  on  the  number  of 
buffers  used  to  keep  incoming  buffers  on  the  I/O  node.  Since  worker  nodes  deliver 
buffers  much  faster  than  the  I/O  node  could  possibly  write  them  to  a  (slow)  physical  disk, 
incoming  buffers  forced  the  operating  system  on  the  I/O  node  to  use  disk  swap  space, 
causing  a  significant  slow  down  in  processing. 


Threaded  Scheme  D  Speedup 
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Figure  3:  Achieved  speedups  of  Scheme  D-Threaded  for  different  loads 
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4  Scheme  E 

4.1  Grouping  Sounding  Data  into  Packages  of  6 

The  improved  version  of  the  PFM  library,  as  well  as  the  original  library,  writes  data  to 
the  physical  disk  in  fixed  length  blocks.  Each  block  can  hold  up  to  six  sounding  values 
(the  value  of  six  is  configurable).  To  take  advantage  of  these  blocks,  sounding  data  are 
sent  in  groups  of  six  (with  the  same  bin  index).  At  first,  slave  nodes  send  depths  grouped 
into  packets  of  six  (with  same  bin  indexes)  to  create  full  depth  records  in  PFM  style. 
When  all  input  file  names  have  been  distributed  (and  most  of  the  files  have  been  already 
processed)  the  finishing  slave  nodes  have  some  leftover  depth  data.  Separately,  each  node 
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Figure  4:  Achieved  speedup  of  Scheme  E  for  different  loads 


cannot  create  a  final  full  depth  record  containing  6  sounding  values  for  one  bin  index. 
Under  the  direction  of  the  master  node,  the  first  slave  node  to  finish  processing  becomes 
a  sorting  and  grouping  node.  Then,  the  other  slave  nodes  send  their  leftover  data  to  that 
sorting  node  for  consolidation.  Subsequently,  full  records  are  sent  to  the  sorting  and 
grouping  node  (the  one  which  currently  is  accepting  data)  with  the  final  leftover  data  sent 
to  the  I/O  node.  This  additional  effort  of  sorting  into  groups  of  six  depths  has  the 
advantage  of  increasing  speed  of  writing  the  data  to  disk  as  well  as  avoiding  any  need  to 
reread  and  rewrite  partially  empty  depth  records.  Further  improvements  include  a  new 
interface  function  to  the  PFM  library  to  accommodate  direct  writing  of  blocks  to  the  PFM 
file.  Also,  slave  nodes  are  given  the  additional  task  of  creation  of  complete  depth  blocks. 
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This  improvement  reduces  the  I/O  node’s  task  to  simply  updating  the  continuation 
pointers  before  writing  data  to  disk. 

4.2  Results 

Figure  4  illustrates  achieved  speedups  for  different  loads,  compared  with  the  original 
serial  application.  The  test  datasets  are  denoted,  as  previously,  by  “L”  (the  Liberty 
dataset),  “12”  (12  fdes),  “24”  (24  fdes),  “48”  (48  files),  and  “74”  (74  files).  The  timings 
have  been  averaged  from  four  runs.  A  comparison  between  speedup  numbers  for 
Schemes  D  and  E  shows  that  values  for  Scheme  E  are  noticeably  smaller,  except  for  the 
largest  test  dataset.  For  the  smallest  dataset,  Scheme  E  shows  a  speedup  around  4.5, 
reducing  the  run  time  from  around  2  min  to  26  sec.  For  the  largest  input  dataset  tested, 
the  measured  speedup  is  around  7,  reducing  execution  time  from  around  24  min  to  3  min 
19  sec.  The  optimal  number  of  MPI  processes  is  10  for  larger  datasets  in  Scheme  E. 

4.3  Filtering,  Recomputing  Steps  and  Final  Tune-up 

The  original  filtering  functionality  works  well  in  the  Beowulf  cluster  environment.  The 
original  swath  filtering  was  being  handled  separately  for  each  input  file.  Thus  the  new 
distributed  scheme  of  processing  input  files  by  a  group  of  slave  nodes  has  no  effect  on 
swath  filtering.  The  same  has  been  found  for  the  recomputing  step  and  area  based 
filtering,  which  run  without  any  changes  to  the  code  because  they  are  executed  in  the 
serial  phase,  on  the  I/O  node,  only  after  the  PFM  file  has  already  been  created. 


Speedup  without  Filtering 


a 

3 

■o 

a> 

o 

a 

<0 


MPI  processes 


- 7  files 

- 12 

24 

- 48 

74 


Figure  5:  Achieved  speedups  of  Scheme  F  with  no  Filtering 
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5  Scheme  F 

A  tuned  version  of  the  new  PFM  library  was  used  in  Scheme  F.  Scheme  F  was  tested 
with  four  different  setups,  involving  two  possible  filtering  procedures:  swath  filtering, 


7  n 

£ 

Speedup  with  Area  Filtering 

o 

rr 

- 

O 

Q. 

3  A 

y  ^  ^  ^  .   ^ 

- 7  files 

- 12 

24 

- 48 

74 

^  4 
<D 

CD  ^ 

Q-  3 

CO 

9 

A 

i 

n 

u  7 

i  i  i  i  i  i  i  i  i  i  i  i  i  I 

3  4  5  6  7  8  9  10  11  12  13  14  15  16 

MPI  Processes 

Figure  6:  Scheme  F  with  Area  Filtering 
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area  filtering,  both  swath  and  area  filtering,  and  no  filtering  (called  “simple  runs”).  Each 
of  these  four  setups  has  been  tested  with  the  standard  five  data  test  loads:  “L”  (the  Liberty 
dataset),  “12”  (12  files),  “24”  (24  files),  “48”  (48  files),  and  “74”  (74  files).  Results  with 
no  filtering  are  illustrated  in  Figure  5. 

It  was  found  that  Scheme  F  achieves  the  best  speedup  for  the  largest  available  test 
dataset.  The  simple  runs  (no  filtering)  exhibited  a  speedup  of  10.  Runs  with  swath 
filtering  enabled  showed  a  top  speedup  of  8.  Runs  with  area  filtering  reached  a  speedup 
of  6.5.  Runs  with  both  swath  and  area  filtering  enabled  showed  a  top  speedup  of  7.  In  the 
simple  runs,  swath  filtering  and  area-based  filtering  are  turned  off.  Figure  6  illustrates 
achieved  speedups  for  different  loads  and  filter  processing.  For  the  smallest  dataset,  when 
swath  filtering  is  turned  on,  worker  nodes  filter  the  sounding  data  by  swath  before 
assembling  them  and  sending  to  the  I/O  node.  Since  swath  filtering  puts  additional 
processing  onto  worker  nodes,  such  code  behavior  is  to  be  expected.  Worker  nodes 
perform  swath  filtering  on  sounding  data  read  from  the  input  files.  When  area  filtering  is 
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Figure  8:  Scheme  F  performance  with  combinations  of  filters. 
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also  turned  on,  the  I/O  node  also  performs  area  filtering  of  sounding  data.  Since  area 
filtering  is  done  exclusively  by  the  I/O  node  after  all  sounding  data  has  been  written  to 
the  PFM  file,  the  area  filtering  is  performed  in  the  serial  phase  of  overall  processing. 

6  The  Role  of  Hard  Drive  Performance 

This  section  presents  arguments  to  support  the  claim  that  the  application  Pfmloader  is 
I/O  bound.  When  Scheme  D  was  developed,  upgrading  the  BIOS  on  each  Beowulf 
cluster  node  resulted  in  significant  improvement  in  the  throughput  of  cluster  IDE  (ATA 
100)  disks.  These  changes  significantly  affected  the  run  time  of  parallel  codes  as  well  as 
the  original  serial  code.  The  changes  for  the  original  serial  code  are  in  the  range  from  5% 
to  11%  (see  table  1).  The  resulting  speedup  is  presented  in  figure  7.  Speedup  numbers  are 
derived  by  comparing  parallel  and  serial  runs.  In  the  case  of  algorithm  speedups,  the 
comparisons  are  done  with  corresponding  timings  for  three  MPI  processes.  The  changes 
range  from  29%  to  57%,  which  at  least  triples  that  of  corresponding  percentage  changes 
for  serial  runs. 
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Number  of  Generic  Sensor  Format  (GSF)  input  files 
12  |  24  |  48 

74 

125.106 

Timings  of  serial  code,  before  BIOS  upgrade 
203.57  |  439.253  |  946.034 

1607.132 

118.439 

Timings  of  serial  code,  with  BIOS  upgrade 
192.216  412.311  893.726 

1433.211 

5.33% 

1 

5.58% 

Percentage  changt 
6.13% 

2k 

5.53% 

10.82% 

Table  1 :  Timings  of  change  in  hard  drive  performance 


7  Summary 

The  optimal  number  of  MPI  processes  for  the  simple  runs  with  large  datasets  is  9.6. 
When  area  filtering  is  turned  on,  the  I/O  node  filters  sounding  data  by  geographic  area. 
Since  area  filtering  is  done  exclusively  by  the  I/O  node  after  all  sounding  data  has  been 
already  written  to  the  PFM  file,  this  processing  adds  to  the  length  of  the  serial  phase  in 
the  overall  processing.  The  serial  processing  nature  of  area  filtering  causes  the  greatest 
reduction  in  performance.  The  results  generated  by  testing  two  of  the  four  types  of 
processing  provide  arguments  for  increasing  the  size  of  the  Beowulf  cluster.  These  two 
processing  types  are:  (a)  processing  with  swath  filtering  enabled,  and  (b)  processing  with 
both  swath  and  area  filtering  enabled.  For  the  largest  test  dataset,  both  processing 
methods  achieved  their  best  speedups  when  running  with  the  maximal  possible  number  of 
MPI  processes.  However,  the  same  data  indicates  the  speedup  increase  would  be  modest. 

In  all  cases,  the  data  strongly  suggests  that  a  speedup  greater  than  10  could  be  achieved 
for  larger  datasets.  This  property  is  a  positive  characteristic  of  the  current  parallel  code.  It 
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also  strongly  suggests  that  the  five  test  datasets  have  not  yet  pushed  the  current  code  and 
cluster  configuration  to  its  limits. 
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